Exploratory data analysis of spatial sailing data in R

What can we learn from 230 dinghy miles?
(Alternative title: That’s how a sailing data scientist spends his time in autumn!)

Sandro Raabe

06th November, 2020

Management Summary

This is a case study on exploratory data analysis of manually collected GPS data. Through various data wrangling packages (XML, tidyverse) and visualization tools (ggplot2, highcharter, leaflet) we explore relationships in the data, which, however, are not further investigated through hypothesis tests.

Introduction

In spring 2020 I joined a local dinghy sailing club on the Alster lake in Hamburg, Germany. This summer I used their boats extensively and built up my dinghy sailing skills. I recorded some of these trips with GPS and examined them with various visualization tools. These are some of my insights:

  • Apparently I don’t like sailing on Wednesdays and Thursday is Alster exploration day.
  • With the boat type Möwe one apparently prefers to stay close to the mooring.
  • Kielzugvogel should take part in German Sailing League
  • Corona enhances solo sailing skills.
  • GPS tracks of Regatta races look like balls of wool.
  • The center of the Alster is (as expected) the sailing hotspot.

Data overview

Since I did not record all sessions (of course), we start with an overview of the data used in this analysis:

  • Period: Sunday, 17.05.2020 to Friday, 06.11.2020
  • Number of recorded GPS points: 92.465
  • Number of recorded days: 46
  • Total distance recorded: 233 nautical miles, which is approximately 419 km
  • Number of used boat types: 10
  • Number of different sailing partners: 16
  • 46 sessions in 24.7 weeks yields 1.9 sessions per average week, or 1 session each 3.8 days

Calendar View

The days in the data set can be displayed without aggregation:

Observation: September was the most active sailing month (14 training sessions).

Days of the week

Now we can split the dataset according to the contained variables, beginning with the weekday.

Frequency distribution

Observation: On Wednesdays, the big alster race kangaroo regatta seems to spoil my sailing, but I seem to like participating at the club’s own Tuesday regatta. Alternatively: After our Tuesday’s regatta, am I so tired that I rather stay home on Wednesdays?

Spatial distribution

We use the Leaflet package for interactive visualisation of all GPS tracks (use mouse wheel or soft buttons for zooming, use legend for weekday switch):

Observation: On Saturdays I like to stay close to the mooring, whereas on Thursdays I sail all the way to the university. The regatta race tracks in the shape of triangles are clearly visible (using zoom).

Headings and wind direction

For every GPS point we know the current heading (a.k.a. the “driving direction”). We can count these and visualise as histogram (frequency diagram) like a compass. Additionally, we draw the wind directions.

The heading around 15° (North-Northeast) and 165° (South-Southeast) seems to be rather popular with me. This is obvious considered the geographical shape of the Alster lake: With its slim North-South orientation one goes more “up and down” then “left and right”. Furthermore, one can clearly see the main wind directions for Hamburgs geographical location: Southwest and Northwest.

Observation: The usual start from the OSG mooring is towards North-Northeast. Furthermore, we often have Southwesterly and Northwesterly wind, so exactly those headings should be rare.

Boats

Frequency

With 14 sessions Conger was my favourite boat. This has two reasons: On the one hand it’s a very beginner-friendly boat (I only did my certification in autumn 2019 and have never touched a sail boat before) - the other reason you can find in the analysis of sailing partners later in this article.

Observation: Conger boats are great for learning how to sail.

Spatial distribution

Observation: Conger and Kielzugvogel can get you anywhere, using the Möwe you better stay close to the mooring.

Which boat is the fastest?

An interesting question arises: Which boat performs best, and what is perfomance? Is it top speed, regardless of wind conditions? Is it speed compared to wind speed? To subtract the influence of different wind speeds from of the single sessions, we re-calculate the measured boat speed in comparison to the wind speed and thus make the speed dimensionless as speed in % of wind speed - 100% means that we went as fast as the wind, at 50% half as fast et cetera.

Interesting to note: Small Conger was almost twice as fast as the wind at its top speed, but the racing dinghies 470er and Laser were only half as fast as the wind on average. These only perform well at high wind speeds, other don’t need as much wind, especially if they are lightweight and have large sails (which is the case with the two winners).

Beobachtung: Kielzugvogel and J70 (the boat used in Bundesliga/German Sailing League) get most power out of any given wind - Kielzugvogel should probably participate at Bundesliga races?

Sailing Partners

Frequency

Observation: With 11 sessions I was most often alone. This was most importantly due to the COVID-19 restrictions in May and June 2020 - I had no chance but to learn one-handed sailing.

Spatial distribution of sailing partners

We use a static (i.e. not interactive) visualisation of the GPS tracks:

Observation: Clearly visible are racing partners Christoph, Bernd and Jochem with their tracks shaped like balls of wool.

Favourite Alster Regions

We consider the two-dimensional density function of all driven tracks. In simple words: We divide the Alster lake into small rectangles and count how often we cross each rectangle during our sailing sessions. Then we can colour the rectangles green to red according to their frequency - like a COVID-19 hotspot map!

Observations: The red hotspots could be the club’s Tuesday race, the mooring on the lower left is clearly visible.

Further Research

This is just a small, visualisation-driven exploratory data analysis of the 2020 season, mostly univariate and without test of the found hypotheses. Interesting questions arose:

  • Can I see the correlation between boat length and boat speed in my data?
  • Can I leverage the yardstick system for a further boat vs. speed research?
  • Besides the 233 nautical miles on the Alster lake I also completed 280 nm on the Baltic sea in 2020. We didn’t include these here, but that would be an interesting analysis, too.

These questions could easily be examined using hypothesis tests or other machine learning methods, which would be out of scope here. The computer chips in the basement are, however, ready and I still need something to do in 2021 :-)

Credits

Thanks OSG for the great community, the crazy boats and the fun things we do together!

The technical stuff

Data collection was carried out using Komoot and Waterspeed mobile apps, this analysis was conducted using R 4.0 and the following useful helpers:

  • Data Input:
    • readr
    • readxl
  • Data Wrangling:
    • dplyr
    • purrr
    • tidyr
    • lubridate
    • glue
  • Graphics:
    • highcharter
    • ggplot2
    • randomcoloR
    • yarrr
  • Spatial Analysis:
  • Output:
    • rmarkdown
    • knitr
    • prettydoc

The Code for calculations and visualisations can be downloaded in my Github Repository: https://github.com/shosaco/sailing_analyses, this page is available at https://shosaco.github.io/sailing_analyses.